-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] add block analyzer queuelength metrics #577
Conversation
34865fa
to
83c67c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Andy!
cmd/analyzer/analyzer.go
Outdated
if err1 != nil { | ||
return nil, err1 | ||
} | ||
return nodestats.NewAnalyzer(cfg.Analyzers.NodeStats.ItemBasedAnalyzerConfig, sourceClient, emeraldClient, sapphireClient, dbClient, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for expanding the node stats analyzer!
It's not great that we're hardcoding the dependency on all the runtimes here. Imagine Sapphire becomes unavailable and we still want stats for consensus; or that we're wanting to test nodestats
locally or in (nonexistent :/) tests, but don't go through the hassle of setting up access to all runtimes.
Three options I can think of:
- Add a config flag (list) for specifying which layers to include; this is probably the most "proper" solution, and quite usable too if the default is
[consensus, sapphire, emerald]
. - Decide which layers to include based on the presence/absence of their block analyzers. That's very convenient but leads to implicit dependencies between analyzers / sections of config, which is super ugly.
- Leave as-is. If e.g. sapphire becomes unavailable, things will continue to work (thanks to lazy initialization of node clients in Connect to oasis-node lazily #555), we'll just see a good amount of error spam in logs.
Your call on whether to go with 1 or 3. I think both are justifiable. I'm partially writing them out to check if option 3 holds the way I've described it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with option 1!
Some notes for future reference:
The code here creates the consensus/runtime clients for all layers even if they weren't specified in the config file. The LazyGrpcConnect function mentioned above allows us to connect to oasis-node only when needed. However, when we instantiate a new runtime client we also make a connections.SDKConnect call a few lines above here, which seems like it would error. However, the underlying connection code only establishes the connection and checks the chainContext; it does not fetch any runtime-specific info. In almost all cases, the default rpc node specified in the config file will pass this check successfully. Thus the only failure case should be when a runtime node is explicitly specified and is down. In this case, the failure is immediate and obvious and can be circumvented by either a) restoring the node or b) removing the problematic layer from the node-stats config list.
83c67c6
to
81ff26a
Compare
3a832f7
to
816e08a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! The are one or two not-entirely trivial comment threads still unresolved, but I expect smooth sails from here on out. I'm LGTMing now as holidays will make it harder to sync.
It looks like you might also have to be careful with the rebase now that #572 is merged? Hopefully not 🤞
48e2d9d
to
15b3db5
Compare
48e50fc
to
d83c474
Compare
d83c474
to
cf37b92
Compare
Task:
Expose 'queueLength' metric equivalent in block analyzers
For more context, see https://github.com/oasisprotocol/nexus/pull/511/files#r1318043060
It'd be nice to report the difference between the indexed block height and the node block height in prometheus. We already have both heights stored by the
node_stats
analyzerThis PR
To enable the queuelength metric for runtime block analyzers, nexus needs the current chain heights of the runtimes. The
node_stats
analyzer currently fetches/stores this data only for consensus, so this PR also adds runtime support for the node_stats analyzer.Note: We could avoid this by fetching the chain heights directly in the block analyzers. Although the height would be fresher, it would also require a second round-trip in the main block processing loop. Since the metrics are internal-only I opted to use the chain heights we already store in
chain.latest_node_heights
. Open to suggestions though.Alternatively alternatively, we could put the metric updating thing in a concurrent loop separate from the main block processing loop. Not sure if it's worth the extra complexity since slow-sync is, by name, expected to be slower.